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The posterior distribution of the number of components fc in a 
finite mixture satisfies a set of inequality constraints. The result holds 
irrespective of the parametric form of the mixture components and 
under assumptions on the prior distribution weaker than those rou¬ 
tinely made in the literature on Bayesian analysis of finite mixtures. 

The inequality constraints can be used to perform an “internal” con¬ 
sistency check of MCMC estimates of the posterior distribution of k 
and to provide improved estimates which are required to satisfy the 
constraints. Bounds on the posterior probability of k components are 
derived using the constraints. Implications on prior distribution spec¬ 
ification and on the adequacy of the posterior distribution of fc as a 
tool for selecting an adequate number of components in the mixture 
are also explored. 


1. Introduction. Finite mixture distributions have received much atten¬ 
tion in the last decade, as a tool for modeling population heterogeneity and 
especially as a conceptually simple way of relaxing distributional assump¬ 
tions. Undoubtedly the development of Markov chain Monte Carlo methods 
has played an essential catalytic role. A survey of the theory and applica¬ 
tions of finite mixtures pre-MCMC is provided by Titterington, Smith and 
Makov (1985), and a more recent introduction to the topic is Robert (1996). 
Progress has been particularly evident in the Bayesian approach, where it 
began with the Gibbs sampling algorithm of Diebolt and Robert (1994) 
for estimating the parameters of a mixture with a fixed number of com¬ 
ponents. Subsequent work has considered the number of components k as 
an object of inference, either using tests to select an adequate number of 
components or summarizing the uncertainty about it by reporting its poste¬ 
rior distribution. Carlin and Chib (1995) and Raftery (1996) have proposed 
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using Bayes factors to test k against A: + 1 components and they have de¬ 
scribed MCMC methods to compute the necessary marginal likelihoods. 
The paper by Raftery contains a summary of such methods. Mengersen 
and Robert (1996) also assume a testing perspective, but use the Kullback- 
Leibler divergence as a measure of distance between models with k and 
A; + 1 components. Nobile (1994), Phillips and Smith (1996), Richardson 
and Green (1997), Roeder and Wasserman (1997) and Stephens (2000) have 
put a prior distribution on the number of components and obtained MCMC 
estimates of the posterior. Besides representing uncertainty about k, its 
posterior distribution can also be used to mix models with different num¬ 
bers of components, leading to model mixing predictions of future observ¬ 
ables. Nobile (1994) attempted to estimate the marginal likelihoods of each 
model separately and then formed an estimate of the posterior of k using 
Bayes’ theorem. Roeder and Wasserman (1997) proposed to approximate 
the marginal likelihoods using the Schwarz criterion. Although their meth¬ 
ods differ considerably, Phillips and Smith (1996), Richardson and Green 
(1997) and Stephens (2000) share a common approach consisting of running 
an MCMC sampler on a composite model, with jumps between submodels 
that allow the sampler to change the number of components in the mix¬ 
ture. Then the posterior of k can be estimated by the relative amount of 
simulation time spent by the sampler in each submodel. 

In this paper I show that, under some conditions on the prior distribution, 
the marginal likelihoods of finite mixture models with a different number 
of components satisfy a set of inequality constraints. Besides its theoretical 
interest, the result provides a means of performing a check of “internal” con¬ 
sistency of MCMC estimates of the marginal likelihoods, or of the marginal 
likelihoods implicit in MCMC estimates of the posterior of k. 

2. The model. Let x = {xi,... ,Xn} be a sequence of (possibly vector¬ 
valued) random variables and assume that the Xi’s are independent and 
identically distributed with probability density function (with respect to 
some underlying measure) given by 

k 

( 1 ) 

i=i 

Model (1) is called a “finite mixture distribution.” The mixture weights Xj 
are the probabilities that the random variable Xi follows any of k alternative 
distributions, with densities called the “mixture components.” In this 

paper attention is restricted to the case where the number of components 
k, the weights Xj and the components Pj{-) are all unknown. It is assumed, 
however, that the densities Pj{-) belong to some specified parametric family. 
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allowed to vary with j. Thus Pj{xi) = Pjixi\9j), where 6j is the vector of 
parameters of the jth mixture component. 

As stated, model (1) is somewhat ambiguous, since the meaning of mix- 
tnre weights and mixture components is completely specified only when k 
is hxed; for instance, the expression “the weight of the second component” 
seems to have a different meaning when k = 2 than it has when fc = 5. In 
order to make explicit the dependence on k of mixtnre weights and compo¬ 
nents, rewrite model (1) as follows: 

k 

f{xi\k,Xk,0k) = J2^jkPjkixi\9jk), i = 
i=i 

where = (Aia,,. . -AkkV and 9k = {9ik,-- ■,9kkV- On occasion A = (Ai, A 2 ,.. 
and 9 = {9i,92, ■ ■ ■)~^ will be used. In principle this formulation allows the 
parametric family of the component to change with j and k. 

Conditional on k, let Pi be an integer in {1,..., A;} denoting the unknown 
component from which the ith observation Xi proceeds. The unobserved vec¬ 
tor g = {gi,..., gn)~^ has been called the “membership vector” or “allocation 
vector” or “configuration vector” of the mixture. If one conditions on g, the 
distribntion of Xi is simply given by the gii\i component in the mixture, 

n 

f{x\k,g,9k) = '[[PgiM^il^giX)- 

i=l 

The complete specihcation of the Bayesian finite mixture model requires 
a prior distribution for all the unknown quantities. The prior on k, denoted 
by Tr{k), has support on (a snbset of) the positive integers and may involve 
a hyperparameter. Given k, the weights Xk = {Xik, ■ ■ ■, Xkk)~^ are assumed 
to have the Dir(aifc,..., prior distribution, where the hyperparame¬ 
ters ak = {aik ,..., akk)~^ are positive constants. Although other priors could 
be used for the weights, the Dirichlet distribution has become a standard 
choice. The allocations pi are conditionally independent given k and Xk with 
Vi[gi = j\k,Xk] = Xjk- Given k, independent priors are usually assumed for 
the component parameters 9jk, 

k 

'K{9k\k ^ (j)k) — J_ J_ jk{9jk\(t^jk^ 1 

J=1 

where (j)jk is the set of hyperparameters in the prior distribution of 9jk and 
4>k = ) 4‘kk)^■ la general the components’ hyperparameters (pjk can 

vary with k, so that substantive prior information distingnishing the com¬ 
ponents and depending on their number k can be accommodated. Similarly, 
the functional form of the prior TTjki') may change with j and k, since the 
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component parametric family may too. Dependence on k is, however, ruled 
out by the assumptions introduced in Section 3. 

In summary, the joint distribution of the data and all unknowns in the 
model is 


( 2 ) 


f{x,e,g,\k) 

= 7r(A:)7r(Afc|/c, ak)f{g\k, \k)'K{9k\k, 4)k)f{x\k,g, 6k). 


In the sequel, attention is focused on a model obtained by integrating the 
parameters Xk and Ok out of model (2). Integrating the weights out of the 
model yields 


f{g\k,ak)= / f{g\k,Xk)TT{Xk\k,ak)dXk 


= I 


r(aofc) TT 


( 3 ) 


r(aofc) -^Tiajk + Uj) 


I[>^T dXk 


T{aok + n)^J-^ T{ajk) 


where aofc = J2^=iOtjk, nj = nj{g) = cardj^j} and Aj = {i'.gi= j} is the 
index set of the observations allocated to the jth component. One can also, 
at least in principle, integrate the component parameters out of the model, 

f{x\k,g,(l)k) = j f{x\k,g,ek)TT{6k\k,(j)k)d9k 



k 

Pgi,ki^i\9g^^k) '^jki9jk\4^jk)d9k 

J=1 


( 4 ) 

( 5 ) 



Pjki^ilOjk)k{9jk\4^jk ) d9jk 

l£Aj 


k 

= n ^jk{x^\(t>jk), 

i=i 


where x^ = {xi :i € Aj} comprises the observations that, according to the 
membership vector g, are from the jth component and qjkix^\4>jk) is a short 
way of writing the integral in (4), that is, the marginal density of these 
observations after the parameter 9jk has been integrated out. 

In the end the joint distribution of the data and unknowns is given by 


(6) f{x,g,k\cj),a) = f{x\k,g,ct)k)f{g\k,ak)TT{k), 

where (j) = {c/)i,c/) 2 , •••)"'' a = (ai,a 2 ,... )^. Even though the (/)’s and a’s 
are fixed constants, I prefer, with a slight abuse of notation, to list them 
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explicitly to the right of the conditioning bars, as it is important to recall 
that they enter in the expressions in (6). The posterior distribution of the 
number of components is 

Tr{k\x,4>,a) (X7r{k)f{x\k,(j)k,ak). 

The marginal likelihoods f{x\k,4’k,otk), which will also be denoted as fk for 
short, are given by 

(7) fk = f{x\k,4>k,ak) = f{x\k,g,4>k)f{g\k,ak), k = l,2,..., 

gsGk 

where the sum extends over the lattice Gk = {g ■ di ^ , k}, i = 1,..., n}, 

the set of membership vectors with components at most k. Representa¬ 
tion (7) demonstrates the great advantage of working with model (6) rather 
than model (2). Using (7) it becomes possible to compare the contributions 
of the same membership vector g to different /fc’s. This leads to linking 
together the marginal likelihoods and deriving a set of linear inequalities 
satisfied by them. 

3. Linking the marginal likelihoods. In this section it is shown that, 
under certain conditions on the prior distribution, the marginal likelihoods 
fk in (7) satisfy a set of constraints. Intuitively, the approach will consist of 
breaking up the sum over Gk in (7) into many terms and then showing that 
some of them can be rewritten as sums over Gt with t < k. The following 
assumptions will be made throughout. 

Assumption A.I. The Dirichlet hyperparameter of any mixture weight 
does not change with the number of components: 

cXj k ^jj ; j I,...,/l 1,A: 2,3,.... 

Assumption A.2. The properties of any mixture component (paramet¬ 
ric family and parameter prior distribution) do not change with the number 
of components: 

Pjk{'\') — Pjji'l') j ^jk~4^jji 

j = l,...,fc-l,A: = 2,3,... . 

The assumptions impose a coherency requirement. Not only the jth com¬ 
ponent “remains the same” whether there are k or k' < k components in 
the mixture (Assumption A.2), but the probability distribution of the ratio 
between the weight of the jth component and the sum of the weights of the 
first k' components also remains unchanged (Assumption A.l). Because of 
Assumptions A.l and A.2, when referring to a certain component one can 
do so without specifying the number of components in the mixture. 
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Begin by noticing that the space of membership vectors Qk in (7) can be 
partitioned as follows: 

k 

(8) Gk = [j Gt, Qt r]G* = 0, s, 

t=l 


where Q* is the set of membership vectors that assigns at least one obser¬ 
vation to the tth component and none to higher components: G* = {9 ^ 
Gt-^i s.t. gi = t}. 


Definition 3.1. Let be the portion of ft that accounts for the mem¬ 
bership vectors g that allocate at least one observation to component t and 
none to higher components (components lower than t may be empty), 

( 9 ) /*= fi^\'t^9,<Pt)fi9\t,at). 

Clearly ff = fi. 

In the sequel use will be made of the following conditions. 


Condition C.l. For all g G Gf with t<k, 


Condition C.2. For all g G Gf with t<k, 

f{9\k,oik) 


/(5lC«t) 




constant. 


Lemma 3.1. Under Assumptions A.l and A.2, the model of Section 2 
satisfies Conditions C.l and C.2 with 


( 10 ) 


F(aofc) F(aof-Ln) 
F(aofc + n) F(aot) 


Proof. To verify Condition C.l, recall (5): f{x\k,g, cfk) = 11^=1 Qjk{x^4>jk)- 
All g (zGt, t < k, allocate no observations to components larger than the tth 
one: = 0,j > t. Therefore the product in (5) extends from 1 to t only. 

Moreover, Assumption A.2 implies that, for j G {l,...,t}, qjk{-\-) = 
and (pjk = 4>jt- Hence f{x\k,g,(j)k) = Uj=iqjt{x^(j)jt) = f{x\t,g,(j)t)- As for 
Condition C.2, from (3) one has 

f{g\k,ak) _ F(aofc) -X T{ajk + nj) / F(aot) T{ajt + nj) 
f{g\t,at) U{aok + n) U{ajk) / [aot + n) 
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Again, for all g €Gt and j > t, Aj = 0 so that rij = 0. Hence the last k — t 
terms in the product in the numerator are 1. Also, from Assumption A.l, 
ajk = Ojt, j = 1,... ,t- Therefore C.2 holds with a^t given by (10). □ 

The following result may be considered as an appetizer. 


Theorem 3.1. Let and he as in (7) and (9) and assume that 
Conditions C.l and C.2 hold. Then 


( 11 ) 

Moreover, 

( 12 ) 


k 


fk = J2^ktft- 

t=i 


fk — l/fc—1 T fk ■ 


Proof. Equation (7) can be rewritten as fk = J2t=i J2g£g* g, (pk) x 
f{g\k,ak) because of the partition of Qk in (8)- Now use Conditions C.l and 
C.2 and the definition of ff in (9) to obtain (11). A little more algebra yields 
( 12 ): 


fc-i 


fk = Yl = fk + Yl 


f{g\k - l,ak-i) 


t=i 


k-l 

= /fc + X! ^k-l,t 

t=l 

k-l 


4=1 f{g\k-l,ak-i) 
f{g\k,ak) 


ft 


f{g\k - l,ak-i) 


ft 


fk ^k,k—l ^ ) 0,k—l,tft l/fc—1 T fk- 


t=l 


□ 


Theorem 3.1 provides two representations of /fc. In (11) it is given as a lin¬ 
ear combination of the “no empty last component” portions of the marginal 
likelihoods of models with 1,2,..., A: components. In (12) it is written as 
the “no empty last component” portion of the marginal likelihood of the 
/c-components model plus a fraction of the marginal likelihood of the model 
with one fewer component. Much of the remainder of this section is devoted 
to deriving a result stronger than Theorem 3.1. This is achieved by exploit¬ 
ing additional symmetry left as yet untapped; some mixture components 
may have identical characteristics. The first step consists in grouping the 
mixture components into classes of “alike” components. 


Definition 3.2. Say that two mixture components j and k are alike or 
equivalent if ajj = Okk, Pjji'l) =Pkk{-\-), = T^kki-l) and (jijj = ejikk- 




A. NOBILE 


The above definition induces a partition of the components into classes of 
equivalence, with two components being in the same class if they are alike. 
It may help intuition to regard the observations as balls being placed in a 
sequence of colored boxes, with boxes of the same color being equivalent. 
Let C{m) be the mth equivalence class and let m.h be the index of the hth 
smallest component in C{m). The classes are ordered so that C{m) precedes 
C{r) if m.l < r.l. Each class contains either a finite number of components, 
possibly one, or countably many components, possibly all. Let N{t) be the 
number of equivalence classes formed by components 1 through t. Also, let 
i{t) be the index of the equivalence class to which component t belongs, so 
that C{i{t)) is the class of components that are equivalent to component t. 
Finally, let c{m,t) be the number of components in C{m) that are no larger 
than t and let c{m) be its total number of components: c(m) = sup^ c(m,t). 
One extreme case often considered in the literature is that of just one equiv¬ 
alence class: there is no prior information distinguishing the components. 
In this case N{t) = 1, C(l) = {1,2,...}, l.h = h, i{t) = 1, c(l, t) = t and 
c(l) = oo. The other extreme case arises when each class contains only one 
component; N{t) = t, C{m) = {m}, m.l = m, i{t) = t, c{m, t) = I{m <t) and 
c{m) = 1, with /(•) the indicator function. 


Definition 3.3. For any membership vector g (zG*, define its class oc¬ 
cupancy pattern as the vector h = (/ii,..., where hm is the number 

of nonempty components in class C{m). 


Let Ht: Gt —^ {0; 1)2,... be the mapping which associates to each 
g &Gt its class occupancy pattern h. Since the domain of Ht is G*-, compo¬ 
nent t is nonempty, hence > 1; also, the number of nonempty compo¬ 
nents cannot exceed the number of observations. Therefore, the range of the 
mapping, Ht = Ht{Gt), consists of the A^(t)-dimensional vectors h satisfying 


N{t) 

(13) ^ hm < n , 

m=l 


hm £ 


{l,2,...,c(m,t)}, 

{0,l,...,c(m,t)}, 


if m = i{t), 
otherwise. 


If J2m=\hm < t, some mixture components in {!,...,t} are empty. This 
suggests that it may be possible to accommodate the class occupancy pattern 
h using fewer than t components. 


Definition 3.4. For any class occupancy h, let s = s{h) be the smallest 
integer such that the mixture components from 1 to s comprise at least hm 
components in C{m), m = 1, ... ,card(/i), 

(14) s = s{h) = min{r: c(m, r) > hm, m = 1, ..., card(/i)}, 

where card(/i) is the number of elements of h. If h &Ht, then card(/i) = N{t) 
and s <t. 
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The symbol s will be exclusively used to denote the function defined 
in (14). For any h ^ Tit, s is the smallest number of components needed 
to accommodate h, so that h G Tig too, under the convention that trailing 
O’s in h are dropped. For instance, suppose that t = 6, C(l) D {1,2,3,6}, 
C(2) D {4}, C(3) D {5}, so that N[<o) = 3. If /i = (2,1,0)^ then only three 
components are nonempty and s = 4. Dropping the trailing 0 in /i, h = 
(2,1)^GH4. 

Definition 3.5. Let Hi. = {h £ Tit'■ r = s{h)} be the (possibly empty) 
subset of class occupancies Ht which can be accommodated with r <t com¬ 
ponents. 

The set of class occupancies of the membership vectors in can be 
partitioned as follows: 

t 

(15) Hinnl = 0, r^q. 

r=l 

If h £HI, then s{h) = r so that h € Hr too, and hence h G HI- This shows 
that 

(16) HicHl, r<t. 

Definition 3.6. Let with t> s{h) be the subset of G* consisting of 
membership vectors with class occupancy pattern h:Ql = H^^{h). 

Clearly, {Gl,h G TYj} is a partition of 

(17) gt= U = h^v. 

h£Ht 

Consider next the mapping Mt : G\ —> Gt which removes any gap in the 
sequence of nonempty components within each equivalence class. More pre¬ 
cisely, given g £ Gh, let jmi < ■ ■ ■ < jm,hm be the corresponding nonempty 
components in C{m), m = 1,..., N{t). The mapping Mt changes, for all m, 
the components jmi, ■ ■ ■ ,jm,hm i^^o m.l,..., m-hm, respectively. Denote the 
range of Mt by £fi = Mt{GD, noting that from the definition of G^ it is im¬ 
mediate that Mt{Gji) = Mr{Gti) for any t,r > s{h). The mapping Mt does 
not affect the class occupancy of a membership vector; thus Hs{£h) = {h}, 
although in general £h is a subset of = Gf^- Because of the equiva¬ 

lence of components within each class, the mapping Mt leaves unchanged 
f{x,g\t,4>t,at), 

(18) f{x\t,g,cl)t)f{g\t,at) = f{x\t,g,4>t)f{g\t,at), 


g ^GLg = Mt{g). 
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Definition 3.7. Let 7 ^ be defined as follows: 


(19) 


H(t) 


n 


c(m, t) 


'yh- hm /’ 

0 , 


h € Tdi, 

hiUt. 


Lemma 3.2. Any element of 8h is the image under Mt membership 
vectors in Qj^. 

Lemma 3.2 says that Q\ consists of 7 )^ subsets alike to £h, except for 
which hm components in each class C{m) are nonempty. Coupled with (18), 
Lemma 3.2 gives 

(20) ^ f{x\t, g, at) = 7h XI -^(^1^’ 9, at). 

g&Gi 

Definition 3.8. Let be the portion oi fs, s = s{h), that accounts for 
the membership vectors in Qf^, 

(21) fl= ^ f{x\s,g,(j)s)f{g\s,as), s = s{h). 

The following lemma is instrumental in proving the main result, Theo¬ 
rem 3.2. 


Lemma 3.3. The function /* defined in (9) can be rewritten as follows: 

(22) ft = X l^fh- 

r=l h&H^ 'h 

Theorem 3.2. Suppose that Conditions C.l and C.2 are verified. Then 

(23) fk = J2^kr ^h’^fl 

r=l hem 

where fl is defined in ( 21 ), 

( 24 ) 18 = 

7/i t=r 

and 7 ^ is given in (19). Moreover, 
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It is worthwhile to consider explicitly the cases where all components are 
equivalent and where no two components are equivalent. 

Proposition 3.1. Suppose that Conditions C.l and C.2 are satisfied 
and that all mixture components are equivalent. Then 

kAn /I \ 

( 26 ) fk = Y.{h) 

h=l ^ ^ 

kAn 

(27) = Qfc.fc-i/fc-i+ 

h=l 

Proof. Recall that if all the components are equivalent then N{t) = 1, 
c{l,t) = t and i{t) = 1. Therefore the class occupancy h is a scalar, the num¬ 
ber of nonempty components in the unique equivalence class. From formula 
(13) the range of h is Tit = {1,..., t A n}, with t An = min(t, n). From Defi¬ 
nition 3.4 the smallest number of components needed to accommodate h is 
s{h) = h. Hence Tifi = {r}, r <t An, and 7^* =0, r > t An. Here the range 
of Mt is £h = Gh, the subset of Gh consisting of membership vectors that 
allocate at least one observation to each component, while (21) gives the 
part of the marginal likelihood fh corresponding to no empty components, 

(28) fh= fiAh,g,4>h)fi9\h,ah). 

gegt 

In this case expression (23) becomes fk = From (19) one has 

Ah = (hl\) so that 7 ^’'“ = (fell) = (fe) and (26) follows. Equation (27) 
can be derived from (25) after making substitutions similar to the ones 
performed to obtain (26). □ 

Formula (26) provides a representation of the marginal likelihood of k 
components as a linear combination of the portions of marginal likelihoods 
corresponding to no empty components. 



Proposition 3.2. Suppose that Conditions C.l and C.2 hold and that 
no two mixture components are equivalent. Then 


(29) 

(30) 


fk — ^ ( 0.ktfi 


t=l 


— <^k,k-lfk-l + fk- 


The proof is left as an exercise for the interested reader. 
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Note that the conclusion of Proposition 3.2 coincides with that of Theo¬ 
rem 3.1, if no two components are equivalent there is no additional symme¬ 
try to be exploited beyond what is assumed by Theorem 3.1. The following 
corollary summarizes some special cases. 


Corollary 3.1. For the model of Section 2, under Assumptions A..\ and A.2, 
one has the following: 


(i) representations (23) and (25) hold with Okr as given in (10); 

(ii) in the special case where all mixture components are equivalent with 
the Dirichlet prior on the mixture weights having hyperparameter ajk = a, 
one has 


kAn 


(31) fk = J2 


k\ T{ka) 


r{ha + n) I 
Jh 


J r{ka + n) r{ha) 


(32) 


/ ka — a — 1 + i 
V ka — 1 + i 


n 


kAn , , .. 


h=i 


r{ka) r{ha + n)^^ 
T{ka + n) T{ha) 


(hi) in case (ii) above with a = l one has 


fk 


kAn 


E 

h=l 


k\ (fc-l)! 

h\{k-h)\ (fc-l + n)! 


{h-l + n)\ I 
{h-l)\ 


k-l (fc-1)! (fc-1)! (h-l + n)! t 

A:-|-re —1 ^ ^ (/i — l)!(fc —/i)! (A: — 1-|-n)! {h — 1)\ ^ 


The representations of the marginal likelihoods fk provided in Theorems 
3.1 and 3.2 and its corollaries lead to a set of linear constraints on the /fc’s. 
Solving the triangular system (11) for the /*’s in terms of the f^s, one 
obtains (12) ff = fk — OfcA-i/fc-i- As the /*’s are, from equation (9), sums 
of strictly positive terms, this implies that 

(33) fk ^ i/fc—!■ 

The constraints (33) hold no matter how the mixture components partition 
into classes of equivalence. In the case of no equivalent components treated 
in Proposition 3.2, the constraints (33) cannot be made any stronger, since 
by how much fk exceeds ak,k-ifk-i, that is, ff, depends on vectors which 
allocate at least one observation to component k. At the opposite extreme of 
all equivalent components, dealt with in Proposition 3.1, stronger constraints 
are obtained by solving the triangular system (26) for the fjfs in terms of the 
/fc’s, and then setting the solution to be positive. These constraints, explicitly 
derived in formula (36), are stronger than (33) because, of all the /^’s in the 
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sum ™ (27), only involves vectors allocating observations to the A:th 

component. As a very special case, consider equation (26) with k> n. Then 
fk is a linear combination of fl,.. .,f^. However, fl,... ,fl can be obtained 
by solving (26) with k = 1,... ,n. Therefore, fk with fc > n is completely 
determined by the marginal likelihoods this is a much stronger 

result than is obtainable when no components are equivalent. The general 
case where only some components are equivalent is covered by Theorem 3.2. 
As usual the constraints (33) hold, but, contrary to the case of all equivalent 
components, one cannot solve system (23) for the f^’s. Nevertheless, there 
might be a function of the /^’s, finer than ff is, such that system (23) can 
be solved for it. 

The remainder of this section deals exclusively with the case where all 
mixture components are equivalent. The triangular system (26) with k = 
1,..., n can be rewritten as 

fc-i 

(34) fk = fk + Yl ^ktft, k<n, 

t=i 

with bkt = {^)akt- Denote by Bn the matrix of coefficients of system (34). 
In this case one can provide a simple explicit expression for the elements of 
B~^. The following lemma is needed. 

Lemma 3.4. Consider the q-dimensional unit lower triangular matrix 
B = {bkt} with bkt = (t)flfci Okt as in Condition C.2. Let C he the unit 
lower triangular matrix with generic element Ckt = (—Then B~^ = 

C. 


Proposition 3.3. Suppose that Conditions C.l and C.2 are satisfied 
and that all mixture components are equivalent. Then 

(35) = + 

t=l 

Proof. The matrix Bn of the coefficients of system (34) is unit lower 
triangular with generic element bkt = {^)akt- From Lemma 3.4, the inverse 
B~^ has generic element 6^* = (—l)^+*(^)afct, k > t, and the result follows. 
□ 


aktfti 


k <n. 


The following corollary follows immediately from Proposition 3.3 and 
summarizes some special cases. 

Corollary 3.2. For the model of Section 2, under Assumptions A.l and 
A.2, one has the following: 
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(i) if all components are equivalent with Dirichlet prior on the weights 
having hyperparameter a, then 


ft {k\ T{ka) r(to + n) 

Jk ) {tjT{ka + n) T{ta) 

(ii) in case (i) above with a = l, 


k<n] 


A .k+t k\ {k-l)l (t-l + n)! 

t'-ik - t)l {k - 1 + n)l (t-1)! 


k <n. 


Briefly returning to the topic of the inequality constraints on the /fc’s, 
from (35) one has 

t=i 

The following section discusses possible uses of these constraints; the present 
one concludes by addressing the problem of expressing with k > n in 
Proposition 3.1 in terms of /i,..., 



(^ktftj 


k < n. 


Proposition 3.4. Suppose that Conditions C.l and C.2 are satisfied 

and that all mixture components are equivalent. Then 

(37) A = t ) (^ n^ 

4. Applications. This section explores some uses of the representations 

of the marginal likelihoods derived in Section 3. 

1. When all mixture components are equivalent, a proper prior on the num¬ 
ber of components is necessary in order to have a proper posterior. 

2. Bounds on the posterior probability of k mixture components can be 
derived that hold for any sample of given size and for any family of 
component distributions. 

3. An “internal” consistency check of Markov chain Monte Carlo estimates 
of the marginal likelihoods f{x\k) can be performed by verifying that 
they satisfy the constraints. Estimates that fail the check can seemingly 
be improved by modifying them so that the constraints are satisfied. 

4. Expressions can be obtained for the prior and posterior distributions of 
the number of nonempty components in the mixture, that is, the number 
of components to which observations are allocated. 
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Throughout this section attention is focused on the case where all mix¬ 
ture components are equivalent, for a variety of reasons: it is important in 
practice, it is amenable to a notationally simpler treatment and it leads to 
stronger results. In order to lighten the notation, the explicit indication of 
the hyperparameters is abandoned in most of this section, so, for instance, I 
will write 7r(A:|a:) and f{x\k) in place of 7r{k\x,(p,a) and f{x\k,(j)k,o:k)- For¬ 
tran and S-PLUS programs used for the computations in this section are 
available from the author upon request. 

4.1. Proper posterior of k. From Bayes’ theorem, the posterior distribu¬ 
tion of the number of components is 



(38) 


where the representation of the marginal likelihoods given in (26) was used. 
Since the series in the denominator of (38) is of positive terms, one can 
change the order of summation to obtain 



(39) 


A proper prior distribution 7r{k) ensures that the posterior is also a proper 
probability distribution. The following theorem shows that, when all mixture 
components are equivalent, this condition is not only sufficient but also 
necessary. 

Theorem 4.1. Consider the model of Section 2, under Assumptions 
A.l and A.2, and suppose that all mixture components are equivalent. Then 
the posterior 7r(/c|x) of the number of components is a proper distribution if 
and only if the prior 7r{k) is proper. 

Proof. The posterior 7r(A:|x) is proper if and only if the series in braces 
in the denominator of (39) converges. Using formula (10) for ajh the series 
become 


r(aofe-kn) -^ jl r(aoj) 
T{aoh)h\ - h)lT{aoj' 


h = 1,... ,n. 


Clearly, if the above series converges when h = n, it also converges for h <n. 
Thus 7r(A:|x) is proper if and only if the following series converges: 


OO 


o' _ 1 o _ OO _L 1 



J 


(40) 
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Since all components are equivalent, aoj = ja for some a > 0. Letting Cj 
denote the generic term of series (40), it is easy to see that 


(41) 


7r(j) ^ ^ ^ T^{j) 

{na + n — I)"- ^ 


To prove the right-hand side inequality note that each of the n terms in 
the product in (40) is smaller than 1/a. For the left-hand side inequality, 
note that each of those terms is larger than {j — n + l)/{ja + n — 1), which 
in turn is no smaller than l/(na -|- n — 1). If the prior on k is proper, it 
then follows, from the right-hand side inequality of (41) and the comparison 
test for series, that the posterior is also proper. Similarly, if the prior is not 
proper, the posterior is also seen to be improper, by an application of the 
comparison test to the left-hand side inequality of (41). □ 


4.2. Bounds on the posterior of k. In this subsection it is assumed that 
the prior distribution on the number of components is proper. A bound on 
T:{k\x) results from the maximization of the right-hand side of (39) with 
respect to subject to f\ > 0. The following result simplifies compu¬ 

tations. 


Proposition 4.1. Among the vectors that maximize the right-hand side 
of (39) there is at least one vector {fl}^^^ with only one nonzero compo¬ 
nent //, with t G {1,..., fc A n}. 

Note that the nonzero component of the maximizer in Proposition 4.1 
need not be the (A; A n)th. Also, note that, as a function of {fl}^^i, the 
right-hand side of (39) is constant over lines through the origin; that is, it 
is homogeneous of zero degree, so that in computing it one can set // = 1. 
Proposition 4.1 restricts the set of vectors one has to compute to 

find a maximizer of (39) to the k An vectors with all but one component 
equal to 0; one can simply compute the right-hand side of (39) for each of 
them and then pick the one that yields the maximum value. 

The bound thus obtained holds, whatever the distributional form of the 
components in the mixture, as long as they are all equivalent. Moreover, it 
only depends on the data through the sample size n. As an example, consider 
the posterior of k for a sample of size n = 82, with a discrete uniform prior 
on k over {1,..., /cmax = 30} and ajk = a = 1 for all j, k. A maximizer of 
(39) with A: = 3 is /g = 1, /^ = 0, /i 7 ^ 3. The posterior of k corresponding to 
the maximizer is reported in Table 1. The bound is 7r(3|a:) < 0.8623. These 
numerical results remain essentially unchanged for any discrete uniform prior 

with fcmax >10. 
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Table 1 

Posterior distribution of k which gives maximum probability to k = 3, assuming that 
n = 82, 7r(fc) = fc = 1,..., fcmax = 30 and a = 1 


fel234 5 6 7 8 9 

7r(fc|a;) 0 0 0.8623 0.1217 1.42 x 10“^ 1.63 x 10“® 1.94 x 10"^ 2.44 x 10“® 3.26 x 10"® 


Table 2 contains bounds on 7r(fc|x) for several values of k and n, under a 
uniform prior on k over /cmax = 50} and a = 1. Tables 3 and 4 contain 

bounds when a = 2 and a = 0.5, respectively. 

Tables 2-4 are still correct, at the reported precision, for any discrete 
uniform prior on k with /cmax > 50. Since the bounds involve the data only 
through the sample size n, they provide a glimpse of the strength of the 
prior distribution. Thus, it is to be expected that, for fixed k, the bounds 
become weaker as sample size increases. Perhaps less obvious is that, for 
fixed sample size, the bounds become stronger as k increases. An intuitive 
explanation is as follows. Suppose that the model with k components has 
considerable posterior mass. The posterior mass of the model with k + 1 
components is at least in part due to the k + 1 copies of Qk embedded in 
Gk+i, all corresponding to at least one empty component. How large this 
part is depends on the prior distribution, but it may well increase with k 
since the larger space contains k + 1 copies of the smaller one. The values 
of the hyperparameters ajk = a also greatly affect the bounds, as one can 
see by comparing Tables 2-4. Increasing a leads to Dirichlet distributions 
that make very small mixture weights less probable. In turn this reduces 
the probability mass assigned by the prior on g to membership vectors with 
empty components. The effect is to “loosen” the link between the marginal 
likelihoods of different numbers of components, thus making the bounds 
weaker. Therefore, a more informative prior on the mixture weights leads to 
weaker constraints on the posterior of k. 


Table 2 


Bounds on 

7r(fc|a:) for several sample 

sizes n, 

7r(/b) ^max5^ 

■ ■ ■ ; ^max 

= 50, a 

= 1 







k 





n 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

20 

0.9000 

0.7286 

0.5299 

0.3456 

0.2880 

0.2419 

0.1954 

0.1756 

0.1505 

0.1335 

50 

0.9600 

0.8847 

0.7826 

0.6645 

0.5414 

0.4233 

0.3175 

0.3119 

0.2835 

0.2402 

100 

0.9800 

0.9412 

0.8858 

0.8170 

0.7385 

0.6541 

0.5677 

0.4828 

0.4023 

0.3322 

500 

0.9960 

0.9880 

0.9762 

0.9607 

0.9417 

0.9193 

0.8938 

0.8656 

0.8350 

0.8022 
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Table 3 


Bounds on 

7r(fc|a:) for several sample 

sizes n, 

7r(/b) ^max!^ 

■ ■ ■ ; ^max 

= 50, a 

= 2 







k 





n 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

20 

0.9756 

0.8976 

0.7636 

0.5932 

0.4168 

0.2958 

0.2718 

0.2084 

0.1915 

0.1554 

50 

0.9956 

0.9797 

0.9473 

0.8963 

0.8268 

0.7414 

0.6447 

0.5426 

0.4411 

0.3459 

100 

0.9989 

0.9945 

0.9852 

0.9695 

0.9465 

0.9156 

0.8766 

0.8299 

0.7762 

0.7167 

500 

1.0000 

0.9998 

0.9993 

0.9986 

0.9975 

0.9958 

0.9937 

0.9908 

0.9873 

0.9830 


4.3. Estimation. In Section 3 the set of constraints (36) on the marginal 
likelihoods was derived for the case where all components are equivalent. 
These constraints can be used to perform a check of internal consistency 
of Markov chain Monte Carlo estimates of the marginal likelihoods f{x\k), 
or of the marginal likelihoods implied by MCMC estimates of the posterior 
of k. The easiest way to check whether the constraints (36) are satished is 
to compute the fl in (35) and see whether they are positive. As an exam¬ 
ple, Richardson and Green (1997) estimate a Bayesian mixture of univariate 
normals for the galaxy data set. They assume that all mixture components 
are equivalent, the prior on k is ^{k) = fe = 1,..., /cmax = 30, and the 
Dirichlet distributions on weights have hyperparameters ajk = 1. They re¬ 
port the reversible jump MCMC estimate of T:{k\x) contained in Table 5. 
Since the prior distribution of k is uniform, the marginal likelihoods are pro¬ 
portional to the posterior of k. Substituting the above estimates of 7r(A:|x) 
for the /t’s in (35), after disregarding the estimate for k > 16, produces, up 
to a proportionality constant, the /^’s implicit in Richardson and Green’s 

estimate. These quantities are reported in Table 6. Three values of fl are 
negative, for k = 12,13 and 15. However, these violations are rather slight, 
almost within rounding error and occur for values of k that account for little 
posterior probability and are, therefore, more difficult to estimate. Thus, if 
anything, the check gives support to Richardson and Green’s estimate. 


Table 4 

Bounds on 7r(fc|a;) for several sample sizes n, 7r(fc) = = 1,..., fcmax = 50, a = 0.5 


n 





k 





1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

20 

0.7342 

0.4684 

0.2734 

0.2575 

0.1863 

0.1783 

0.1449 

0.1343 

0.1202 

0.1030 

50 

0.8354 

0.6477 

0.4709 

0.3229 

0.2983 

0.2618 

0.2096 

0.2047 

0.1782 

0.1664 

100 

0.8847 

0.7456 

0.6032 

0.4703 

0.3546 

0.3166 

0.2972 

0.2610 

0.2236 

0.2189 

500 

0.9491 

0.8833 

0.8090 

0.7306 

0.6515 

0.5742 

0.5006 

0.4320 

0.3691 

0.3392 
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Table 5 

Reversible jump MCMC estimate of n{k\x) for the galaxy data set 
reported by Richardson and Green (1997) 


k 

1 

2 

3 

4 

5 

6 

7 

8 

7r(fc|a;) 

0.000 

0.000 

0.061 

0.128 

0.182 

0.199 

0.160 

0.109 


k 

9 

10 

11 

12 

13 

14 

15 

> 16 

7r(fc|a;) 

0.071 

0.040 

0.023 

0.013 

0.006 

0.003 

0.002 

0.003 


Checking whether MCMC estimates of f{x\k) or 7r(A:|x) satisfy the con¬ 
straints only makes marginal use of the information supplied by them. This 
information can be more fully exploited by incorporating it in the estimation 
procedure. For instance, one could estimate the /^’s by MCMC methods and 
then use (26) to transform those estimates into estimates of the marginal 
likelihoods fys. I will return to this point at the end of Section 4.4. Here 
I only sketch some approaches to transform estimates of the /^’s into esti¬ 
mates that satisfy the inequalities (36). 


Table 6 

Estimates, up to a proportionality constant, of fl implicit in Richardson and 
Green (1997) MGMG estimate of'n{k\x), galaxy data set 


k 

1 

2 

3 

4 

5 6 7 8 

fl 

0.0000 

0.0000 

0.0610 

0.1194 

0.1532 0.1413 0.0792 0.0352 


k 

9 

10 

11 

12 

13 14 15 

fl 

0.0167 

0.0015 

0.0035 

-0.0005 

-0.0008 0.0013 -0.0006 


^ABLE 7 

Mode of (43), galaxy data, f is the Richardson and Green (1997) 
estimate given in Table 5 


k 

1 

2 

3 

4 

5 

6 

7 

8 

ffc 

0.000 

0.000 

0.061 

0.128 

0.181 

0.198 

0.160 

0.109 










k 

9 

10 

11 

12 

13 

14 

15 


ffe 

0.071 

0.041 

0.023 

0.013 

0.007 

0.003 

0.002 
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Let f = (/(x|2),..., /(a;|/cmax))^ be the vector of marginal likelihoods of 
the models with k components, fc = 2,..., fcmax- Also, let f be the correspond¬ 
ing vector of MCMC estimates. When the mixture components parameters 
have conjugate prior distributions, fi = f{x\l) can be computed exactly; if 
this is not the case, the vectors f and f also include f{x\l) and its estimate. 
The estimates f might be directly available, as in the approaches of Nobile 
(1994), Carlin and Chib (1995), Raftery (1996) and Roeder and Wasserman 
(1997). Alternatively, they may only be computed up to a proportionality 
constant, from the prior on the number of components and an estimate 
7r(A:|x) of its posterior, as in the approaches of Phillips and Smith (1996), 
Richardson and Green (1997) and Stephens (2000). In this latter case, the 
constraint proceeding from J2k=T ^(^1^) = 1 is disregarded. Estimates of the 
variability of f can be computed, either by replicating the MCMC runs or 
by using single run methods, such as batching and time series methods [see, 
e.g.. Chapter 6 of Ripley (1987) or Geyer (1992)]. It is assumed that as the 
MCMG sample size increases, the distribution of f approaches a multivariate 
normal 

(42) ^ N{0,I), 

where S is a consistent estimate of the variance-covariance matrix of f. Let 
R be the region where the constraints (36) are satished. If f ^ i?, an estimate 
of f which satishes the constraints is the maximizer over R of the likelihood 
L(f) associated with (42). From a Bayesian viewpoint, this is equivalent to 
using S as a plug-in estimate of S, employing ) as the prior distribution 
of f and estimating f by the mode of its posterior distribution, which is 
proportional to 

(43) exp{-i(f-f)^£-i(f-f)}/B(f). 

The posterior mode is the point in R which is closest to f with respect to the 
metric induced by S. Hence, unless f G R, the mode will occur on the bound¬ 
ary of R, where the multivariate normal contours are tangent to R. The max¬ 
imization of (43) is equivalent to the minimization of (l/2)f^S“^f — f 

subject to [b 2 : ba : • • • : > —bi/i, where the vector b^ has generic en¬ 

try bkt = {-l)^'^\^^)aktlik >t),t = 2 ,.. . ,/Cmax- This is a simple problem in 
quadratic programming, for which software is publicly available; for instance, 
Goodall (1995) provides a basic S-PLUS implementation. Table 7 contains 
the f which maximizes (43) with f equal to the estimates of Richardson and 
Green (1997) given in Table 5. 

Another estimate of f, which satisfies the constraints (36) and does not 
lie on the boundary of R, is the mean of the distribution (43), which can be 
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Table 8 ^ 

Estimate of the mean of (43), galaxy data, f is the Richardson 
and Green (1997) estimate given in Table 5. The estimate has 
been rescaled in order that ffc = 1 


k 

1 

2 

3 

4 

5 

6 

7 

8 

ffc 

0.000 

0.000 

0.061 

0.126 

0.182 

0.197 

0.156 

0.109 


k 

9 

10 

11 

12 

13 

14 

15 

> 16 

ffc 

0.069 

0.040 

0.023 

0.013 

0.008 

0.005 

0.003 

0.008 


estimated by averaging independent draws from the posterior (43). However, 
drawing from the Ai(f, S) distribution and using a rejection technique can 
be very inefficient, if R is in the tail of the distribution. When this occurs, 
Gibbs sampling provides a more efficient alternative; working in terms of 
the distribution of the /^’s, a multivariate normal restricted to the positive 
orthant, leads to full conditional distributions that are univariate normals 
restricted to the positive reals. Table 8 contains an estimate of the posterior 
mean computed from 20,000 draws from (43), obtained using rejection, with 
f being Richardson and Green’s (1997) estimate for the galaxy data. On 
the whole, the mean of (43) agrees with the estimate of Richardson and 
Green (1997), although it tends to give some more weight to models with 
a larger number of components. Table 9 displays the /^’s corresponding to 
the estimate of the mean of (43) given in Table 8. These estimates of the 
/2’s agree with those reported in Table 6 for values of k up to 9, then they 
drop off much more regularly while remaining positive. 

4.4. The number of nonempty components. Bayesian and classical anal¬ 
yses of the same data may lead to widely contrasting conclusions about the 


Table 9 

Estimates of fl corresponding to the mean of (43) given in Table 8 


k 

1 

2 

3 

4 

5 

6 

7 

8 

fl 

0.0000 

0.0000 

0.0612 

0.1180 

0.1536 

0.1395 

0.0766 

0.0370 










k 

9 

10 

11 

12 

13 

14 

15 


fl 

0.0146 

0.0033 

0.0019 

0.0007 

0.0003 

0.0002 

0.0002 
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number of mixture components. A stylized account of a typical situation is 
as follows: a classical analysis identifies k components as sufficient to provide 
a good fit to the data. On the other hand, the posterior of the number of 
components assigns considerable probability to values of k > k. Moreover, 
the posterior predictive distribution, conditional on k, of the next observa¬ 
tion remains essentially the same for all k> k. Much of this divergence of 
conclusions derives from the use of the same term, in the two approaches, 
to denote different entities. In the Bayesian approach the parameter k de¬ 
notes the number of components in the mixture model, not the number of 
components from which data are actually observed. It is instead this sec¬ 
ond meaning that is attached to “number of components” in the classical 
approach; accordingly, determining the number of components amounts to 
finding k such that k mixture components afford a good fit of the data. The 
difference between the two approaches can be highlighted by positing a very 
small sample size, say n = 3; the classical approach will point at just one 
component, while the posterior of k will be much the same as the prior. 
In the Bayesian approach it is quite possible for the posterior of k to as¬ 
sign much probability to values larger than the number of components from 
which the data have originated. In fact, in Section 4.2 it was shown that, 
for a certain prior distribution, when n = 82 the posterior probability of 
three components is no larger than 0.8623, whatever the data are. This oc¬ 
curs because the posterior probabilities of four and more components cannot 
be too small, since they also account for allocation vectors with only three 
nonempty mixture components. As noted in Section 4.2, the strength of this 
link depends on the prior distribution of the mixture weights and it tends 
to abate as the sample size increases. However, the usefulness of the poste¬ 
rior of fc, as a tool for selecting or estimating the number of components in 
a mixture, tends to be put in question by the fact that it may, to a very 
large extent, reflect probability mass associated with membership vectors 
that allocate observations to fewer than k components. 

In summary, while the classical approach addresses the question: 

Ql. How many components are needed to fit the data welU 
The posterior of k is suited to answer: 

Q2. How many components are likely to he in the model that generated 
the data? 

While Q2 is concerned with the number of components in the mixture, 
Ql deals with the number of nonempty components. Since the Dirichlet 
prior on the mixture weights determines how likely empty components are 
to arise, it appears that the answer to Q2 depends on the prior specification 
more than the answer to Ql. This section seeks to pursue in a Bayesian way 
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the objective of the classical approach, by deriving an expression for the 
posterior distribution of the number of nonempty components. 

Let h denote the number of nonempty components in the mixture. The 
joint prior distribution of the number of components k and the membership 
vectors g induces a prior on h. Since h<k, one has 

CXD 

fih) = '^'^{k)f{h\k), h = l,...,n. 
k=h 

Let be the set of all membership vectors in which assign observations 
to exactly h components, 

(44) Ql 

t=h 

Then the conditional distribution of h given k can be computed by summing 
f{g\k,ak) over 

(45) f{h\k)=^f{g\k,ak), h = l,...,kAn. 

g&Gl 

The following proposition provides a representation of f{h\k) which makes 
its computation feasible for sample sizes up to about 100; for larger samples 
sizes an estimate can be obtained by stochastic simulation. 


Proposition 4.2. Consider the model of Section 2 under Assumptions 
A.l and A.2 and suppose that all mixture eomponents are equivalent. Let d = 
d{ni,... ,nh) be the number of distinct entries in the veetor (ni,...,n/i)"'" • 


also let mi,. 
Then 

f{h\k) = 


(46) 


.,md be the frequeneies of the distinet nj's in {ni,... ,nh) 


r{ka) 
T{ka + n) 

X E 


0 < ni < ■ • ■ < n/j 

ni H-h = n 


n 


ni,...,nhj \mi,...,md 


h 


n 

i=i 


r(a + Uj) 

r(a) ^ 


T 


h = l, 


,k An. 


Note that the sum in (46) does not involve k; this allows one to easily ob¬ 
tain f{h\k) with k> h from f{h\k) with k = h. Therefore, one only needs to 
compute the sum in (46) at most n times. The total number of terms in these 
n sums is the number p{n) of partitions of n into integer summands without 
regard to order; tabulated values of p{n) are in Table 24.5 of Abramowitz 
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Fig. 1. Prior distribution of the number h of nonempty components when n = 82, 
7r(fc) = = 1,- ■ ■ ,fcmax = 30 and a = 1. 


and Stegun (1964). Figure 1 contains a plot of the prior distribution of h 
corresponding to the prior used by Richardson and Green (1997) for the 
galaxy data. The computation was done in Fortran and took six minutes on 
a PC with a 1.1 GHz processor. 

The posterior distribution of the number of nonempty components can 
be written as 

OO 

(47) /(h|x) = ^ 7r(/i:|x)/(h|A:,x), h = l,...,n. 

k=h 

The following result provides a representation of the posterior of h in terms 
of the fl^s, the portions of the marginal likelihoods corresponding to no 
empty components. 


Proposition 4.3. Consider the model of Section 2 under Assumptions 
A.l and A.2 and suppose that all mixture components are equivalent. Then 


(48) f{h\x) 


fj ^ r{ha + n) 

\h) T{ka + n) T{ha) 


h = 1,..., ra. 


Since the prior distribution of h is only specified indirectly, through the 
priors on k and the mixture weights, one may prefer to consider, rather than 
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Fig. 2. Estimates of f{x\k) and f{x\h) for the galaxy data, both normalized to sum to 1. 
Circles denote the estimate of f{x\k) reported in Table 8; dots are the estimate of f{x\h) 
obtained using the f^’s given in Table 9. 


the posterior of h, the marginal likelihood f{x\h) for h nonempty compo¬ 
nents. This quantity is readily derived from (48): 


f{x\h) 


fh ^ r{ha + n) 

’\h)T{ka + n) T{ha) 


Estimates of f{x\h) are obtained by replacing the /^’s with the estimates 
produced in Section 4.3. Figure 2 displays estimates of f{x\h), normalized 
to sum to 1, along with normalized estimates of the marginal likelihoods 
f{x\k), for the galaxy data using the prior of Richardson and Green (1997). 
As one would expect, the marginal likelihoods of the number of nonempty 
components favor a smaller number of components than the posterior of 
k, effectively narrowing the plausible range of normal components in the 
observed data to between three and eight. 

As a conclusion, note that the path here followed from estimates of the 
/fc’s to estimates of the /^’s to estimates of f{x\h) can also be travelled 
in the opposite direction. For instance, it would be immediate to obtain 
estimates of f{h\x) using Richardson and Green’s (1997) reversible jump 
algorithm. These could then be turned, using (48), into estimates, up to a 
proportionality constant, of the /^’s and finally estimates of the marginal 
likelihoods fy automatically satisfying the constraints (36). 
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APPENDIX: PROOFS 

Proof of Lemma 3.2. The inverse image under Mt of g G £h consists 
of all the g GGji which differ from g only in that the nonempty components 
in each class can be any of the components in the class that are smaller 
than t, rather than being the first ones. If m 7^ i{t), there are c(m, t) compo¬ 
nents in C{m) no larger than t, of which only hm are nonempty; this yields 
ways of selecting the nonempty components out of the c{m, t) candi¬ 
dates. As GhCGt, component t is nonempty; this leaves — 1 nonempty 
components to be selected among — 1 candidates in yielding 

possible selections. Multiplying together the numbers of possible 

i{t) 

selections in the N{t) classes yields (19). □ 

Proof of Lemma 3.3. Use in (9) the partition of Gt given in (17) to ob¬ 
tain /j* = J2heHt 9, at). Replace the inner sum with the 

expression in (20): f* = T,heHt^hT,ge£t, f{x\t,g,4>t)f{g\t,at). Next recall 
that £h C Gt and use C.l and C.2: ft = T,heHt g, 4>s)f{g\s, as). 

Then use again (20) and then (21) to produce ff = Y.h&m 'Egt^g^ f{x\s,g, (l)s)f{g\s, as) 

From the partition of Ht in (15) it follows that ff = 

G nl){-fi/Yh)fl where the 
second equality uses the relationship in (16). Now (22) follows since, for all 
h G I{hG Tit) — 0 implies that 7^ = 0. To see this consider h G Tit\Tit. 

Since s{h) = r, h Gfit would imply h Gfit contrary to the hypothesis; hence 
h^Tit and from Definition 3.7 7)^ = 0. □ 

Proof of Theorem 3.2. Substitute formula (22) in (11) to obtain fk = 
T,t=iO-ktT,i=iatrT,hGH^^ilh/'yh)fh- recall that aktatr = a-kr and inter¬ 

change the order of the two outer sums, fk = Er=i T,t=r T,heH^ilh/'Th)fh = 

T,r=i akr fhi^hh) T,Lr Ih' Finally, use (24) to produce (23). To prove (25) 

replace ff in (12) with the expression provided by (22) with t = k. □ 

Proof of Corollary 3.1. Part (i) follows from Lemma 3.1 and The¬ 
orem 3.2. Equations (31) and (32) of part (ii) are obtained by replacing Ukh 
in (26) and (27) with the expression given in (10) and using a^k = ka. Part 
(iii) follows straightforwardly from part (ii) with a = 1. □ 

Proof of Lemma 3.4. Let D = {dkt} with D = BC. Then D is lower 
triangular with generic element dkt = Yt!l=i^krCrt = Yt’l=i{~^y^^^krbrt = 
with the last equality holding since B is lower trian¬ 
gular. It then immediately follows that D has unit diagonal elements, since 
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so has B. Therefore, it only remains to show that dkt = 0, k > t. Now use 
the definition of b^t and Condition C.2, 


r=t 

=i;(-ir‘ 


^kr 


k\ 


rl 


drt 


r\{k — r)! t\{r — t)\ 


^kt - 


1 ) 


r+£ 


r=t /• ’V /• \ / r=t 

Next, change the summation index to j = r — t to obtain 

k—t /I , \ / 1 \ k—t 


k — t 
k — r 


^kt 


^kt 


j =0 ^ J / \ / 


k — t 


= 0 , 


as the sum J2j is null because of a basic property of binomial coefficients 
[see, e.g., Abramowitz and Stegun (1964), page 10, Property 3.1.7]. □ 


Proof of Proposition 3.4. In the formula for 4 given in (26) with 
k> n, replace // with the expression in (35) to produce 

/'= = E())“'=*E(-lU’'([)“*Tr 

-|;-/4(-i)-())(:) 

( 49 ) =t-^rfr(’;)t{-iy^^(l-jX 

r=l ^ '' t=r ^ ' 

Call S the inner sum and rewrite it by changing the summation index to 
j = t-r and making nse of , 



Now, if n — r is even, add n — r — 2j to the exponent of (—1). This leaves S un¬ 
changed, so that S = Z]j=o (—1)”“'’“'^ ’ wiaere the last equality 

follows from a property of the binomial coefficients [see, e.g., Abramowitz 
and Stegun (1964), Section 24.1.1, Relations II.B]. If n — r is odd, premulti¬ 
ply the sum J2j iii (50) by —1 and add n — r — 2j to the exponent of (—1), 
yielding S = Thus, in general, S = ■ Finally, sub¬ 
stituting the above expression of S for the sum right-hand side of 

(49) and changing the index from r to t yields (37). □ 
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Proof of Proposition 4.1. Rewrite (39) as follows: 

kAn n 

( 51 ) = ^^1 fhdh/fhbh^ 

h=l h=l 

where dh = TT{k){^)akh and bh = H is immediate that a 

maximizer has /^ = 0, h > k An , for otherwise Tr { k \ x ) could be increased by 
simply setting these components to 0 and leaving the other ones unchanged. 
Suppose next that has at least two nonzero components: there exist 

t, r G {1,... , kAn }, t^r, such that f} ^0, f} ^ 0. Without loss of generality, 
assume that 

bt ~ dt' 

Define a new vector {fh}h=i with ft = f} + {hr/bt)f},fr = 0, fh = fl,h ^ t,r. 
One can easily verify that replacing with fh in the right-hand side of (51) 
leaves the denominator unchanged, while (52) ensures that the numerator 
does not decrease; fhdh > fl^h- Therefore one can replace fl 

with fh in (51), that is, select one of the nonzero components, set it to 0 
and correspondingly adjust the other one, without decreasing TT { k \ x ). An 
appeal to induction completes the proof. □ 



Proof of Proposition 4.2. Substituting in (45) f{g\k,ak) from (3) 
and using the fact that all components are equivalent, one obtains 


f{h\k)=^ 


r{ka) -AT{a + nj) 
r(/ca-|-n) r(a) 


The sum is over vectors g with exactly h nonempty components, so only h 
terms in the products are not equal to 1. Since the terms in the sum do not 
depend on which components are nonempty, the sum is equal to (^) times a 
sum over , the subset of Qh comprising vectors which allocate observations 
to all the h mixture components. Therefore, 


f{h\k) 


T{ka) ( k\ Ar(a-Lnj) 
T(ka + n)\h) T(a) 


The terms in the above sum depend on g only through (rii,..., nh)~^ ■ There¬ 
fore one can replace the sum over with a sum over all partitions of the n 
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observations in h groups. Since to each partition {ni,... ,nh)~^ there corre¬ 
spond membership vectors in one has 


f{h\k) 


r{ka) 

T{ka + n) 


V 

^ \ni, 

nj>0,j = l,...,h 

ni H- \-nh = n 


n 


,nh 



r(Q + nj) 

r(a) 


Finally, since the terms in the sum are invariant to a change in the order 
of the rij's, the sum above can be replaced by a sum over ordered rij’s. As 
to each ordered vector {m,... ,nh)~^ there correspond ^ mj unordered 
ones, (46) follows. □ 


Proof of Proposition 4.3. The conditional distribution of h given 
k and x in (47) can be obtained by summing the conditional distribution of 
g given k and x over all membership vectors in which allocate observa¬ 
tions to exactly h components; f{h\k,x) =Y.g^gk{f{x\k,g)f{g\k)}/f{x\k). 

Substituting this expression in (47) produces 


f{h\x) 


^ f{x\k)'n:{k) ^ f {x\k, g) f {g\k) 


(53) 


1 

fix) 


^7r(A:) Y, fix\k,g)fig\k). 




Consider now the inner sum in (53): 


k 

Y fix\k,g,4>k)fig\k,ak) = Y fix\k,g,4'k)fig\k,ak) 

g&Ql 


k 

= EE f{x\t,g,cl)t)fig\t,at)akt 


k 

= Y^ktih Y fix\t,g,(t^t)fig\t,at), 

t=h g££h 

where the first equality uses (44), the second one follows from Conditions 
C.l and C.2 and the third uses (20). Now, when all components are equiv¬ 
alent £h = Gh, so that using again Conditions C.l and C.2 one obtains 

k 

Y fix\k,g,(l)k)fig\k,ak) =Y(^kt'yh Y fix\h,g,(l)h)fig\h,ah)ath 

geg^ t=h g&g^l 







30 


A. NOBILE 


k 

— CLkhfh 7ft, 

t=h 

with the second equality following from formula (28). Since 7 ^ = it 

follows that the inner sum in (53) equals (^)oA:ft//[; so that 

As an aside, note that the series in the right-hand side was already met in 
the denominator of (39). Substituting a^h with the expression in (10) and 
using aok = ka yields (48). □ 
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